Deep Learning Nyu.Week 5
- Gradient Descent
- Worst optimization method in the world
- Optimization problem
- minimize f(w) over w
- w(k + 1) = wk - (step) * (Del) f(wk)
- Assume f is differentiable and continuous – not true
- Actually sub differentiable
- "It should work; no theory to support this"
- Follow the direction of the negative gradient
- we look at the optimization landscape locally
- landscape = domain of all weights in the neural network
- find the best solution relative to where we are
- Consider a quadratic optimization problem
- positive definite case
- can calculate this as matrix * distance from solution
- which gives 1 - smallest eigenvalue / largest eigenvalue step reduction in step size
- largest / smallest = condition number
- poorly conditioned – l is very large, well conditioned l is small
- Step sizes
- we don't have a good estimate of learning rate
- try a bunch of values on the log scale
- ideally choose an optimal step size
- we tend to choose the largest possible learning rate – at the edge of divergence
- Stochastic optimization
- Actually used to train nets in practice
- Replace gradient with a stochastic approximation to the gradient
- Gradient of the loss for a single instance
- instance chosen uniformly at random
- (full loss is sum of all the fis)
- Expected value of sgd is full gradient
- useful to think of it as gd with noise
- Annealing
- neural network landscapes are bumpy
- SGD -> particularly the noise helps it jump over these minima
- good minima are larger and harder to skip
- Also valuable because
- we have a lot of redundancy
- SGD exploits the redundancy
- can be thousands of times cheaper
- can be hard to trust GD instead
- Minibatching
- use batches randomly chosen
- practical reasons are overwhelming
- much more efficient utilization of hardware
- eg. imagenet uses batch sizes of 64
- distributed training
- "ImageNet in one hour"
- Full batch
- Do not gradient descent
- LBFGS
- 50 years of optim research
- scipy has a bulletproof implementation
- CPU doesn't have batch size critical
- Always try mini-batching
- Momentum
- trick to always use with SGD
- momentum parameter in the network
- w(k + 1) = wk - gammak * delta + betak * (wk - w(k-1))
- update both p and w – damp the old momentum and add gradient
- p is accumulated gradient buffer – past gradients are reduced – running sum of gradients
- stochastic descent uses gradient
- "Stochastic heavy ball method"
- gradient keeps pushing the direction in the same direction instead of dramatic changes
- small beta – can change direction more quickly; high beta makes it harder to turn
- high beta helps dampen oscillations
- beta = .9, .99 always works well
- momentum also increases the step size (for past gradients)
- change step size to 1/(1 - beta).
- why it works
- acceleration contributes to performance
- Nesterov – did a lot of research
- Acceleration
- Noise smoothing
- momentum averages gradients
- it adds smoothing that makes things become a good approximation to the solution
- reduces the bouncing
- SGD works – well conditioned
- otherwise poorly conditioned
- Adaptive methods
- Maintain an estimate of a better rate separately for each weight
- lots of different ways to do this
- smaller learning rates for weights later in the network, larger in the early weights
- fairly hand-wavy
- RMSProp
- normalize by root mean square of the gradient
- ![20210617image.png]()
- ADAM: Adaptive moment estimation
- ![20210617image.png]()
- Bias correction in full adam increases the value during early stages
- Occasionally doesn't converge
- Poorly understood
- Has worse generalization error
- Small neural networks will have different results depending on initial values
- Normalization layers
- Linear -> norm -> activation or
- Conv -> norm -> ReLu
- They don't make the network more powerful
- Whitening operation to update the data
- with some additional parameters to allow all ranges of values
- adds more parameters to the layer: learnable scaling and bias term
- y = a / stddev * (x - mean) + b
- often they reverse the parametrization
- a & b move slowly as they're learned
- Batch norm
- bizarre, but works very well
- normalize across batch
- estimates mean and stddev across all instances in a mini batch
- breaks all the theory of SGD
- layer instance and group norm are other norms that work
- group norm works where batch norm works
- Why does normalization help?
- the network becomes easier to optimize, can use larger lrs
- adds noise, which helps with generalization
- makes weight initialization less important
- allows plugging together multiple layers with impunity
- allows for automated architecture search
- typically resulted in a poorly conditioned network
- have to backpropagate through the calculation of the mean and stddev
- for batch/instance norm: mean/std are fixed after training
- group/layer can update the values
- Death of optimization
- try to use a big neural network to solve the optimization problem
- Practicum
- Convolution dimensions output: n - k + 1 by m